POLI 572B

Michael Weaver

March 8, 2021

Inference with Least Squares

Objectives

Recap

  • Least Squares as a predictive algorithm

Inference

  • Least Squares as an estimator of some unknown parameters

Today

Least squares as

  • algorithm for prediction
    • for “machine learning” to predict \(E(Y | X)\)
    • where we care about accuracy \(E(\widehat{Y} | X)\), not about correctness of slope estimates

Today

Least Squares

  • statistical model for inference
    • what is the statistical model? (like, e.g. Neyman Causal Model)
    • what are the assumptions we need for…
      • unbiasedness of estimator
      • deriving uncertainty of estimate

Recap

Least Squares for Prediction

So far, we…

  • …start with a vector \(\mathbf{Y} = (Y_1 \ldots Y_i \ldots Y_n)\)

  • … want to produce a vector of predicted values \(\mathbf{\widehat{Y}} = (\widehat{Y_1} \ldots \widehat{Y_i} \ldots \widehat{Y_n})\)

  • …. define \(\mathbf{\widehat{Y}} = \mathbf{X}\mathbf{\beta}\):

    • where \(\mathbf{X}\) is a matrix of predictors (for the mean, just a vector of \(1\)s) and \(\mathbf{\beta}\) is a vector of linear coefficients that are multiplied by \(\mathbf{X}\) such that \(\widehat{Y}_i = b_0x_0 + b_1x_1 + \ldots + b_px_p\)

Least Squares for Prediction

Then, we…

  • … choose \(\beta\) such that \(\mathbf{\widehat{Y}}\) have minimum (euclidean) distance to \(\mathbf{Y}\):

    • minimize \(\mathbf{Y} - \mathbf{\widehat{Y}} = \mathbf{e}\); where \(\mathbf{e}\) is a vector of residuals (prediction errors) \((Y_1 - \hat{Y_1}, \ldots , Y_i - \hat{Y_i}, \ldots, Y_n - \hat{Y}_n)\)
  • \(\mathbf{e}\) is minimized when \(\mathbf{e}'\mathbf{X} = \mathbf{0}\): residuals are orthogonal to \(\mathbf{X}\).

This gave us \(\mathbf{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\)

Least Squares for Prediction

This definition of least squares means that:

  • residuals \(\mathbf{e}\) are orthogonal to \(\mathbf{X}\); so if there is an intercept, the mean of the residuals is \(0\), and covariance of residuals and \(\mathbf{X}\) is \(0\).

  • If we just fit an intercept (same value of \(\widehat{Y}\) for all cases): what is the mean of the residuals \(E(Y_i - \widehat{Y_i})\)?

Return to Multivariate

Let’s say we want to get the conditional expectation function for yearly earnings of professionals as a function of hours worked, profession, gender, and age.

Using data from the American Community Survey, we can look at how hours worked is related to earnings for doctors and lawyers.

  • UHRSWORK (the usual hours worked per week)
  • FEMALE (an indicator for female)
  • AGE (in years)
  • LAW (an indicator for being a lawyer)
  • INCEARN (total annual income earned in dollars).

Return to Multivariate

Let’s now find the linear conditional expectation function of earnings, across hours worked, age, profession, and gender:

\(Earnings_i = b_0 + b_1 Hours_i + b_2 Age_i + b_3 Law_i + b_4 Female_i\)

Example

## 
## Call:
## lm(formula = INCEARN ~ UHRSWORK + AGE + LAW + FEMALE, data = acs_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -304486  -90963  -29238   72845  743690 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11083.5     9484.6   1.169    0.243    
## UHRSWORK      1075.5      114.7   9.373   <2e-16 ***
## AGE           3452.8      125.2  27.578   <2e-16 ***
## LAW         -48919.9     2674.6 -18.291   <2e-16 ***
## FEMALE      -37223.1     2799.8 -13.295   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 127800 on 9995 degrees of freedom
## Multiple R-squared:  0.146,  Adjusted R-squared:  0.1457 
## F-statistic: 427.2 on 4 and 9995 DF,  p-value: < 2.2e-16

How do we make sense of, e.g., the slope on UHRSWORK?

Example

Example

Example

How could we change our regression equation to make the slope on hours worked easier to interpret?

Angrist and Pischke, MHE, pp. 77-80

Inference

Regression for Inference:

  1. We want to interpret the \(\beta\) vector causally.

  2. There is a true \(\beta_D\) (\(\delta\)) that is the effect of some \(D\) on \(Y\).

  3. \(\beta_D\) is an unknown parameter.

  4. To estimate \(\beta_D\), need a statistical model for least squares

A Statistical Model

As with experiments (e.g. the Neyman causal model), statistical models

  • treat observed data as realizations of random variables (some chance process)
  • with some assumptions…
  • permit us to connect what we observe to unobserved parameters

Need to clarify what work is being done by

  • data
  • regression mathematics
  • assumptions of statistical model

An example: Hooke’s Law

Hooke’s law is a law of physics that states that the force (\(D\)) needed to extend or compress a spring by some distance \(y\) increases linearly with respect to the distance.

Thus, for a given weight \(D_i\) on a spring, the length \(Y_i\) is determined by:

\(Y_i = \alpha + \beta D_i + \epsilon_i\) for \(i\) in \(1 \ldots n\)

An example: Hooke’s Law

\(Y_i = \alpha + \beta D_i + \epsilon_i\) for \(i\) in \(1 \ldots n\)

This is a statistical model:

  • \(\alpha\) and \(\beta\) are unknown parameters (thus, greek letters, not \(a\) and \(b\)). They are fixed.
    • \(\alpha\) is the length of spring in absence of force
    • \(\beta\) is the change in length of spring with unit of force
  • \(\epsilon_i\) is a random error (a random variable). Each \(\epsilon_i\) is…
    • independent of \(D_i\)
    • drawn from the same distribution with mean \(0\) and variance \(\sigma^2\) (another parameter)

We use observed data to estimate the parameters \(\alpha, \beta, \sigma^2\)

In groups:

Imagine we have a spring whose length is 30 cm. With each Newton of force applied, the spring increases by 0.5 cm.

Using these facts and the statistical model described above: create a response schedule/potential outcomes table corresponding to the spring under loads of \(0,1,2,3,5,8\) Newtons.

The Regression Model:

The statistical model assumes that this linear equation describes the process generating the data.

\[Y_i = \alpha + \beta D_i + \epsilon_i\]

deterministic (not random) \(\alpha + \beta D_i\)

stochastic (random error) \(\epsilon_i\)

The Regression Model:

\(\alpha\): parameter for the intercept (fixed for all cases)

\(\beta\): parameter for the slope (fixed for all cases)

\(D_i\): value of independent variable (input into the equation)

\(Y_i\): value of dependent variable (determined by the equation)

\(\epsilon_i\): random error for observation \(i\)

  • each \(\epsilon_i\) is realization of same random variable with mean \(0\) and variance \(\sigma^2\).
  • independent of each other and \(D_i\)

The Regression Model:

The more general regression model (where \(X_i\) is a matrix of variables)

\[Y_i = \mathbf{X_i\beta} + \epsilon_i\]

can be estimated with least squares:

\[Y_i = X_i\hat{\beta} + e_i\]

where residuals used to estimate variance of errors: \(Var(e) = \widehat{\sigma^2}\)

The Regression Model:

Differences from Least Squares Algorithm

  • we now have estimates (\(\widehat{}\)) of parameters
  • we now have unobserved errors \(\epsilon_i\) instead of residuals \(e_i\).

Least Squares is the Best Linear Unbiased Estimator of our model parameters under 3 key assumptions.

  • by unbiased we mean \(\beta - E(\widehat{\beta}) = 0\)
  • “best” as in estimate has least variance (not so important in practice)

Model Assumptions:

\((1)\). Model equation is correctly specified

  • correct functional form (e.g., linear) for process by which variables in \(X\) affect \(Y\), and how variables in \(X\) affect each other.
  • all relevant variables affecting \(Y\) (will define next week) are included in the model
  • \(Y_i = \mathbf{X_i}\mathbf{\beta} + \epsilon_i\) is the true data-generating function

In groups: come up with 2 ways this assumption could be wrong in the context of this model:

\(Earnings_i = b_0 + b_1 Hours_i + b_2 Age_i + b_3 Law_i + b_4 Female_i + \epsilon_i\)

Model Assumptions:

\((2.)\) \(\epsilon_i\) are independent, identically distributed (i.i.d.)

  • \(E(\epsilon_i) = 0\) and variance of \(\sigma^2\).
  • homoskedastic
    • errors \(\epsilon\) all come from same distribution with given variance
  • in contrast to heteroskedastic:
    • errors come from different distributions with different variances.

In groups: come up with 2 ways this assumption could be wrong in the context of the Hooke’s Law example.

Model Assumptions:

\((3.)\) \(\epsilon_i\) is independent of \(X_i\): \(X_i \perp \!\!\! \perp \epsilon_i\)

  • This means \(X\) is independent of \(\epsilon\) (recall dependence versus independence)
  • independence is not orthogonality

In groups: come up with 2 ways this assumption could be wrong in the context of this model:

\(Earnings_i = b_0 + b_1 Hours_i + b_2 Age_i + b_3 Law_i + b_4 Female_i + \epsilon_i\)

Model Assumptions:

Under these three assumptions, we call the model “ordinary least squares”.

We cannot empirically verify these assumptions.

In particular, cannot verify assumption 3:

  • When using least squares to estimate this model, we mathematically impose \(e_i\) to be orthogonal to \(x_i\) and \(E(e_i) = 0\). This is a mathematical feature of regression, not proof that \(X_i \perp \!\!\! \perp \epsilon_i\)

Model Assumptions:

If these assumptions are reasonable for the data we observe, then with large \(n\), we can estimate our parameters \(\beta, \sigma^2\) using least squares.

And \(\beta\) could have a causal interpretation.

Estimating the Regression Model

The Regression Model:

What work is done by

  • the data?
  • the math of regression?
  • the assumptions?

The Regression Model:

Mathematical Assumptions (least squares):

  • design matrix has full rank (no linear dependence b/t columns)
  • \(n \geq p\)

Statistical Assumptions (statistical model):

  • \(Y_i\) is generated by \(\mathbf{X_i\beta} + \epsilon_i\)
  • \(\epsilon_i\) are independent, identically distributed, with variance \(\sigma^2\)
  • \(X_i \perp \!\!\! \perp \epsilon_i\)

Estimation:

If \(\mathbf{X}\) is \(n \times p\) matrix: We want to estimate \(p + 1\) parameters… (what are they?)

\(\beta\) estimated by \(\widehat{\beta}\) using least squares:

\[\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\]

  • Note: \(\widehat{\beta}\) is a random vector; each element is a random variable. Why?

\(Y_i = X_i\beta + \epsilon_i\)

\(\epsilon_i\) is random variable.

If we were to repeat the drawing of \(\epsilon\) again, we would get different estimates: \(\widehat{\beta}\).

Estimation:

Under the statistical assumption we made, \(\widehat{\beta}\) is an unbiased estimator of \(\beta\), conditional on \(X\): \(E(\widehat{\beta}|X) = \beta\)

  • “conditional on \(X\)” means taking the values of \(X\) as fixed

LS estimate is unbiased:

By assumption: \(Y = \mathbf{X\beta} + \epsilon\), so:

\(\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\)

\(\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'(\mathbf{X\beta} + \epsilon)\)

\(\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{X\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\)

\(\widehat{\beta} = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\)

LS estimate is unbiased:

\(\widehat{\beta} = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\)

So:

\(E(\widehat{\beta}|X) = E(\mathbf{\beta}|\mathbf{X}) + E((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon|\mathbf{X})\)

\(E(\widehat{\beta}|X) = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\epsilon|\mathbf{X})\)

And because, by assumption, \(X \perp \!\!\! \perp \epsilon\)

\(E(\widehat{\beta}|X) = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\epsilon)\)

And because, by assumption, \(E(\epsilon) = 0\)

\(E(\widehat{\beta}|X) = \mathbf{\beta}\)

Variance of LS Estimator?

Thus, if our model assumptions are correct, \(\widehat{\beta}\) is an unbiased estimate of causal parameter \(\beta\)

  • \(\widehat{\beta}\) is a random vector, as it depends on chance draws of \(\epsilon\)
  • As \(n \to \infty\), \(\widehat{\beta} \approx \beta\), but there is still uncertainty.

How do we discover the uncertainty in our estimate \(\widehat{\beta}\)?

Variance of our Estimate

Variance of OLS Estimator:

  • Each element of \(\widehat{\beta}_{p \times 1}\) is a random variable
    • each \(\widehat{\beta}_k\) has a variance
    • and a covariance with other \(\widehat{\beta}_{j\neq k}\)}
  • These variances and covariances are found in the variance-covariance matrix of \(\widehat{\beta}\)
    • a \(p \times p\) matrix
    • Diagonal elements are the variances for \(\widehat{\beta}_{1} \dots \widehat{\beta}_p\)
    • off-diagonal elements are covariances;
    • matrix is symmetric

(all \(\beta\)s should be hatted)

\[\scriptsize{\begin{pmatrix} Var(\beta_1) & Cov(\beta_1,\beta_2) & Cov(\beta_1,\beta_3) &\ldots & Cov(\beta_1,\beta_p) \\ Cov(\beta_2,\beta_1) & Var(\beta_2) & Cov(\beta_2,\beta_3) & \ldots & Cov(\beta_2,\beta_p) \\ Cov(\beta_3,\beta_1) & Cov(\beta_3,\beta_2) & Var(\beta_3) & \ldots & Cov(\beta_3,\beta_p) \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ Cov(\beta_p,\beta_1) & Cov(\beta_p,\beta_2) & Cov(\beta_p, \beta_3) & \ldots & Var(\beta_p)\end{pmatrix}}\]

Variance of OLS Estimator:

How do we use the variance-covariance matrix?

  • The square-root of diagonal elements (variances) gives standard error for each parameter estimate in \(\widehat{\beta}\) (hypothesis testing)

  • The off-diagonal elements can help answer: \(\beta_2 + \beta_3 \neq 0\). We need \(Cov(\beta_2, \beta_3)\) to get \(Var(\beta_2 + \beta_3)\). (complex hypothesis testing, e.g. interaction effects )

Variance-Covariance Matrix

\(\widehat{\beta} = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\), So:

\[cov(\widehat{\beta}|X) = E((\widehat{\beta} - \beta)(\widehat{\beta} - \beta)' | X)\]

\[ = E( ((\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\epsilon)((\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\epsilon)' | X)\]

\[ = E( ((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon)(\epsilon'\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}) | X)\]

\[ = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\epsilon\epsilon'|X)\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\]

What really matters here is: \(E(\epsilon\epsilon'|X)\)

Variance-Covariance Matrix

All variance-covariance matrices are “sandwiches”:

\[ (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\epsilon\epsilon'|X)\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\]

\((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\) is the bread; \(E(\epsilon\epsilon'|X)\) is the meat.

Later on, we can make different choices of “meat”.

Variance-Covariance Matrix

In groups: if \(\epsilon\) is a vector that is \(n \times 1\) with elements \((\epsilon_1 \ldots \epsilon_n)\)

  1. What are the dimensions of the matrix: \(\epsilon\epsilon'\)?

  2. What are the elements on the diagonal of the matrix?

  3. What is the expected value of the elements on the off-diagonal

Hints: We have assumed that \(E(\epsilon) = 0\); \(\epsilon_i\) are independent of each other

Variance-Covariance Matrix

\[E(\epsilon\epsilon'|X) = \sigma^2\mathbf{I}_{n\times n}\]

\(\epsilon\epsilon'\) is \(n \times n\) matrix with \(ij\)th elements \(\epsilon_i\epsilon_j\)

  • this is the variance covariance matrix of the errors (we don’t subtract means because \(E(\epsilon) = 0\))

Variance-Covariance Matrix

Taking the expectation:

  • If \(i \neq j\), the elements of the matrix are \(E(\epsilon_i\epsilon_j)\). Because, by assumption, \(E(\epsilon) = 0\), and \(\epsilon\)s are independent, \(E(\epsilon_i\epsilon_j) = E(\epsilon_i)E(\epsilon_j) = 0 \times 0 = 0\)
  • but, for \(i = j\), \(E[\epsilon_i^2] = E[(\epsilon_i - E[\epsilon])^2]\) (a variance)
  • and because \(\epsilon\) is identically distributed, then \(Var(\epsilon_i) = Var(\epsilon) = \sigma^2\) for all \(\epsilon\)

The result is a \(n \times n\) matrix, with \(\sigma^2\) on the diagonal, \(0\) everywhere else.

Variance-Covariance Matrix

\[ = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\sigma^2\mathbf{I}_{n\times n}\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\]

\[ = \sigma^2(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\] \[cov(\widehat{\beta}|X) = \sigma^2(\mathbf{X}'\mathbf{X})^{-1}\]

This gives us the variance-covariance matrix of \(\widehat{\beta}\). Can we start computing standard errors for \(\widehat{\beta}\) now?

Variance-Covariance Matrix

We use residuals to stand in for \(\epsilon\).

\[\widehat{\sigma^2} = \frac{\sum\limits_i^n e_i^2}{n - p}\]

\(E[\widehat{\sigma^2} | X] = \sigma^2\) (unbiased)

  • Why \(n-p\)? The variance of the sample mean required \(n-1\), because we calculated the mean first, so we lost 1 degree of freedom. With bivariate regression, we lose 2 df. With \(p\) variables, we lose \(p\) degrees of freedom.

Variance-Covariance Matrix

so, we estimate the variance-covariance matrix:

\(\widehat{cov}(\widehat{\beta}) = \widehat{\sigma^2}(\mathbf{X}'\mathbf{X})^{-1}\)

Standard Errors of \(\widehat{\beta_p}\)

The variance (\(p\)th diagonal element of \(\widehat{cov}(\widehat{\beta})\)) is

\(Var(\hat{\beta_p}) = \frac{\hat{\sigma^2}}{\sum\limits_i^n (X^*_{ip} - \overline{X_p^*})^2}\)

So the standard error is:

\(SE(\hat{\beta_p}) = \sqrt{Var(\hat{\beta_p})} = \frac{\widehat{\sigma}}{\sqrt{n}\sqrt{Var(X_p^*)}}\)

  • standard errors get smaller (more precise): with growing \(n\), increased (residual) variation in \(X_p\), smaller error variance (\(\sigma^2\))

Sampling Distribution of \(\hat{\beta}\)

asymptotic normality:

Even if \(\epsilon\) not normally distributed: under our assumptions, then \(\lim_{n\to\infty}\) sampling distribution of \(\hat{\beta}\) is normally distributed by the central limit theorem. Reasonable if \(n\) is moderately large and data is not highly skewed.

Because we estimate standard errors: \(t\) statistic for hypothesis testing of \(\widehat{\beta}\) asymptotically \(t\) distributed:

Standard errors

In sum:

  • We get \(\widehat{\sigma^2}\) by taking variance of the residuals (\(e\))
  • We take diagonal of variance-covariance matrix to get variances of slopes (\(\widehat{\sigma^2}(\mathbf{X}'\mathbf{X})^{-1}\))
  • Take square root of variances to standard errors for \(\widehat{\beta}\)

Standard errors

For standard errors of \(\widehat{\beta}\) to be correct, we must assume:

  1. \(Y\) is generated by \(\mathbf{X\beta} + \epsilon\)
  2. \(\epsilon_i\) are independent, identically distributed, with variance \(\sigma^2\) for all \(i\)
  3. \(X_i \perp \!\!\! \perp \epsilon_i\)

If any assumption is wrong: standard errors will be incorrect.

If assumption 1 and 3 are wrong: unbiasedness of \(\widehat{\beta}\) does not hold.

In groups:

  1. \(Y\) is generated by \(\mathbf{X\beta} + \epsilon\)
  2. \(\epsilon_i\) are independent, identically distributed, with variance \(\sigma^2\) for all \(i\)
  3. \(X_i \perp \!\!\! \perp \epsilon_i\)

What is an example of the assumption being wrong?

Why would it lead our standard errors to be wrong?

OLS Assumptions:

When do things go wrong?

  • We incorrectly specify the model (it is multiplicative, or we forgot a term)
  • \(Y_i = \alpha + \beta_1 x + \beta_2 x^2 + \epsilon\)
  • \(Y_i = \alpha + \beta_1 x + \beta_2 z + \epsilon\)
  • Errors are heteroskedastic: errors have different variances.
  • E.g.: At higher levels of education, incomes take on wider range of values (think PhD vs. MBA students earning potential), so the errors are larger.

OLS Assumptions:

  • Errors are not independent: errors for different observations are correlated.
  • We observe the same counties over time. The error we observer for county \(i\) at time \(t\) is likely to be correlated with the error of county \(i\) at time \(t+1\), because it is the same place.
  • \(\epsilon\) is not independent of \(X\). The errors are in fact correlated with \(X\):
  • This is “confounding”. Our estimate \(\hat{\beta}\) as the coefficient of \(x\) will be biased. If we excluded \(z\) from the equation incorrectly, the bias in \(x\) will be a function of the relationships between \(x\) and \(z\), and between \(z\) and \(y\).

Conclusion:

  1. \(Y\) is generated by \(\mathbf{X\beta} + \epsilon\)
  2. \(\epsilon_i\) are independent, identically distributed, with variance \(\sigma^2\) for all \(i\)
  3. \(X_i \perp \!\!\! \perp \epsilon_i\)

With these assumptions:

  • \(\widehat{\beta}\) is unbiased estimator for \(\beta\)
  • known sampling distribution of \(\widehat{\beta}\)